Goto

Collaborating Authors

 Emotion


'We May Have a Crisis on Our Hands': The Unregulated Rise of Emotionally Intelligent AI

TIME - Tech

'We May Have a Crisis on Our Hands': The Unregulated Rise of Emotionally Intelligent AI Pillay is an editorial fellow at TIME. Pillay is an editorial fellow at TIME. At least once a month, two-thirds of people who regularly use AI turn to their bots for advice on sensitive personal issues and emotional support. Many people now report trusting their chatbots more than their elected representatives, civil servants, faith leaders--and the companies building AI. That's according to data from 70 countries, gathered by the Collective Intelligence Project (CIP).


To Err Like Human: Affective Bias-Inspired Measures for Visual Emotion Recognition Evaluation

Neural Information Processing Systems

Accuracy is a commonly adopted performance metric in various classification tasks, which measures the proportion of correctly classified samples among all samples. It assumes equal importance for all classes, hence equal severity for misclassifications. However, in the task of emotional classification, due to the psychological similarities between emotions, misclassifying a certain emotion into one class may be more severe than another, e.g., misclassifying'excitement' as'anger' apparently is more severe than as'awe'. Albeit high meaningful for many applications, metrics capable of measuring these cases of misclassifications in visual emotion recognition tasks have yet to be explored. In this paper, based on Mikel's emotion wheel from psychology, we propose a novel approach for evaluating the performance in visual emotion recognition, which takes into account the distance on the emotion wheel between different emotions to mimic the psychological nuances of emotions. Experimental results in semi-supervised learning on emotion recognition and user study have shown that our proposed metrics is more effective than the accuracy to assess the performance and conforms to the cognitive laws of human emotions.


E 3 : Exploring Embodied Emotion Through A Large-Scale Egocentric Video Dataset

Neural Information Processing Systems

Understanding human emotions is fundamental to enhancing human-computer interaction, especially for embodied agents that mimic human behavior. Traditional emotion analysis often takes a third-person perspective, limiting the ability of agents to interact naturally and empathetically. To address this gap, this paper presents $E^3$ for Exploring Embodied Emotion, the first massive first-person view video dataset.


Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

Neural Information Processing Systems

Accurate emotion perception is crucial for various applications, including human-computer interaction, education, and counseling.However, traditional single-modality approaches often fail to capture the complexity of real-world emotional expressions, which are inherently multimodal.


Cross-Space Synergy: A Unified Framework for Multimodal Emotion Recognition in Conversation

Lyu, Xiaosen, Xiong, Jiayu, Chen, Yuren, Wang, Wanlong, Dai, Xiaoqing, Wang, Jing

arXiv.org Artificial Intelligence

Multimodal Emotion Recognition in Conversation (MERC) aims to predict speakers' emotions by integrating textual, acoustic, and visual cues. Existing approaches either struggle to capture complex cross-modal interactions or experience gradient conflicts and unstable training when using deeper architectures. To address these issues, we propose Cross-Space Synergy (CSS), which couples a representation component with an optimization component. Synergistic Polynomial Fusion (SPF) serves the representation role, leveraging low-rank tensor factorization to efficiently capture high-order cross-modal interactions. Pareto Gradient Modulator (PGM) serves the optimization role, steering updates along Pareto-optimal directions across competing objectives to alleviate gradient conflicts and improve stability. Experiments show that CSS outperforms existing representative methods on IEMOCAP and MELD in both accuracy and training stability, demonstrating its effectiveness in complex multimodal scenarios.


Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention

Wang, Cong, Geng, Yizhong, Wen, Yuhua, Li, Qifei, Gao, Yingming, Wang, Ruimin, Wang, Chunfeng, Li, Hao, Li, Ya, Chen, Wei

arXiv.org Artificial Intelligence

Speech emotion recognition (SER) is an important technology in human-computer interaction. However, achieving high performance is challenging due to emotional complexity and scarce annotated data. To tackle these challenges, we propose a multi-loss learning (MLL) framework integrating an energy-adaptive mixup (EAM) method and a frame-level attention module (FLAM). The EAM method leverages SNR-based augmentation to generate diverse speech samples capturing subtle emotional variations. FLAM enhances frame-level feature extraction for multi-frame emotional cues. Our MLL strategy combines Kullback-Leibler divergence, focal, center, and supervised contrastive loss to optimize learning, address class imbalance, and improve feature separability. We evaluate our method on four widely used SER datasets: IEMOCAP, MSP-IMPROV, RAVDESS, and SAVEE. The results demonstrate our method achieves state-of-the-art performance, suggesting its effectiveness and robustness.


Learning What to Attend First: Modality-Importance-Guided Reasoning for Reliable Multimodal Emotion Understanding

Rha, Hyeongseop, Yeo, Jeong Hun, Won, Junil, Park, Se Jin, Ro, Yong Man

arXiv.org Artificial Intelligence

In this paper, we present Modality-Importance-Guided Reasoning (MIGR), a framework designed to improve the reliability of reasoning-based multimodal emotion understanding in multimodal large language models. Although existing methods have advanced emotion understanding, they often suffer from reasoning drift: models gradually rely on their own generated text instead of multimodal evidence, and their explanations are overly shaped by visually initiated reasoning paths. To address these issues, we introduce Modality Importance (MI), a simple yet effective mechanism for identifying the emotion-dominant modality. Using MI, MIGR reorganizes reasoning sequences so that explanations begin from the modality most critical to the target emotion, preventing early reasoning from being misled by less informative cues. Our two-stage framework-comprising modality-aligned supervised fine-tuning and modality-aware reward optimization-encourages models to generate emotionally grounded, causally relevant, and coherence-preserving explanations. Experimental results on the DFEW benchmark show that MIGR substantially improves reasoning reliability, decreasing instances of correct predictions accompanied by emotionally inconsistent explanations from 18.10% to 7.37%. These results confirm the benefit of initiating reasoning from the emotion-dominant modality.


Story2MIDI: Emotionally Aligned Music Generation from Text

Shokri, Mohammad, Salem, Alexandra C., Levine, Gabriel, Devaney, Johanna, Levitan, Sarah Ita

arXiv.org Artificial Intelligence

Abstract--In this paper, we introduce Story2MIDI, a sequence-to-sequence Transformer-based model for generating emotion-aligned music from a given piece of text. T o develop this model, we construct the Story2MIDI dataset by merging existing datasets for sentiment analysis from text and emotion classification in music. The resulting dataset contains pairs of text blurbs and music pieces that evoke the same emotions in the reader or listener . Despite the small scale of our dataset and limited computational resources, our results indicate that our model effectively learns emotion-relevant features in music and incorporates them into its generation process, producing samples with diverse emotional responses. We evaluate the generated outputs using objective musical metrics and a human listening study, confirming the model's ability to capture intended emotional cues. We live in a world with an ever-growing demand for entertainment and multimedia content. The rise of social media and platforms for music, audio-books, and podcasts has gained tremendous momentum. At the heart of many of these forms of entertainment lies a narrative, a story that drives the experience, whether in a film, a game, a podcast, or a documentary.


EQ-Negotiator: Dynamic Emotional Personas Empower Small Language Models for Edge-Deployable Credit Negotiation

Long, Yunbo, Liu, Yuhan, Brintrup, Alexandra

arXiv.org Artificial Intelligence

The deployment of large language models (LLMs) in automated negotiation has set a high performance benchmark, but their computational cost and data privacy requirements render them unsuitable for many privacy-sensitive, on-device applications such as mobile assistants, embodied AI agents or private client interactions. While small language models (SLMs) offer a practical alternative, they suffer from a significant performance gap compared to LLMs in playing emotionally charged complex personas, especially for credit negotiation. This paper introduces EQ-Negotiator, a novel framework that bridges this capability gap using emotional personas. Its core is a reasoning system that integrates game theory with a Hidden Markov Model(HMM) to learn and track debtor emotional states online, without pre-training. This allows EQ-Negotiator to equip SLMs with the strategic intelligence to counter manipulation while de-escalating conflict and upholding ethical standards. Through extensive agent-to-agent simulations across diverse credit negotiation scenarios, including adversarial debtor strategies like cheating, threatening, and playing the victim, we show that a 7B parameter language model with EQ-Negotiator achieves better debt recovery and negotiation efficiency than baseline LLMs more than 10 times its size. This work advances persona modeling from descriptive character profiles to dynamic emotional architectures that operate within privacy constraints. Besides, this paper establishes that strategic emotional intelligence, not raw model scale, is the critical factor for success in automated negotiation, paving the way for effective, ethical, and privacy-preserving AI negotiators that can operate on the edge.


Echo-N1: Affective RL Frontier

Zhang, Naifan, Sun, Ruihan, Su, Ruixi, Ma, Shiqi, Zhang, Shiya, Weng, Xianna, Zhang, Xiaofan, Zhan, Yuhan, Xu, Yuyang, Chen, Zhaohan, Pan, Zhengyuan, Song, Ziyi

arXiv.org Artificial Intelligence

The LLM field has spent a year perfecting RL for tasks machines already excel at, math, code, and deterministic reasoning, while completely sidestepping the domain that actually defines human intelligence: subjective, emotionally grounded, personality sensitive conversation. This space has often been regarded as inherently subjective and challenging to formalize, making it appear unsuitable for conventional RL pipelines. We show that it is not only possible and it is a solvable and transformative RL problem. We propose the first framework that infers user personality on the fly and optimizes model behavior toward personalized conversational preferences. Contrary to the widespread belief that RL collapses in non-verifiable settings, our method produces consistent, robust, and dramatic improvements in humanlike interaction quality. We also introduce the first dynamic emotional intelligence evaluation suite to quantify these gains. Our model, which is introduced as Echo-N1, behaves far above its base version and outperforming the proprietary Doubao 1.5 Character. This work establishes a new frontier for RL: optimizing models for the deeply subjective, deeply human dimensions of conversation.